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Abstract — Understanding network structure and having access 
to realistic graphs plays a central role in computer and social 
networks research. In this paper, we propose a complete, and 
practical methodology for generating graphs that resemble a 
real graph of interest. The metrics of the original topology we 
target to match are the joint degree distribution (JDD) and the 
degree-dependent average clustering coefficient (c(fe)). We start 
by developing efficient estimators for these two metrics based 
on a node sample collected via either independence sampling or 
random walks. Then, we process the output of the estimators 
to ensure that the target properties are realizable. Finally, we 
propose an efficient algorithm for generating topologies that have 
the exact target JDD and a c(k) close to the target. Extensive 
simulations using real-life graphs show that the graphs generated 
by our methodology are similar to the original graph with respect 
to, not only the two target metrics, but also a wide range of 
other topological metrics; furthermore, our generator is order of 
magnitudes faster than state-of-the-art techniques. 

I. Introduction 

Understanding network structure and having access to realis- 
tic graphs plays a central role in computer and social networks 
research. To this end, researchers use various approaches: they 
collect real measurements, often involving sampling; use the 
datasets for a particular study and/or make them publicly avail- 
able; and develop analytical models that can generate topologies 
with key properties resembling those of the network of interest. 
Each of these approaches is a challenging research question 
on its own, involving complex tradeoffs between accurate 
representation of the original graph and practical constraints in 
terms of measurement overhead, algorithm complexity, privacy. 

For example, the popularity of online social networks (OSNs) 
has given rise to a number of measurement studies to improve 
the understanding of their characteristics. Being able to charac- 
terize and simulate the topology of the social graph is important 
for evaluating the effectiveness of a growing number of social 
network applications that attempt to leverage the social graph. 
The commonly used approach is to measure these networks and 
make the dataset available and properly anonymized. Given the 
size of most of these networks, they are not typically measured 
in their entirety, instead sampling is used to estimate properties 
of interest. Another approach is to develop models that allow 
to generate graphs that meet certain properties of interest, such 
as node degree distribution, clustering coefficient, community 
structure, diameter, etc. 

So far, network sampling and topology generation have been 
looked at separately. We believe there is a need for a complete 
methodology that starts by sampling a real (yet not fully known) 
graph, estimates properties of interest, and generates synthetic 
graphs that resemble the original in a number of important prop- 
erties; the methodology should also meet practical constraints 
such as sampling budget and computational complexity. 



There is, of course, a plethora of metrics one could be 
interested in when analyzing and generating graphs. Ideally, one 
would like to generate synthetic graphs that resemble the orig- 
inal in as many topological properties as possible. In practice, 
there are two main limitations. First, the topological properties 
of interest should be estimated based on a sample of the real 
graph; the more involved the properties, the larger sample is 
needed. Second, constructing a graph with given properties 
becomes more difficult for more restrictive properties. 

Given these practical limitations in estimation and graph 
generation, we focus on the following two metrics: joint degree 
distribution, JDD, and degree-dependent average clustering co- 
efficient, c(k). We develop a complete, practical methodology 
for generating graphs that resemble a real graph of interest in 
terms of these two metrics. 

We follow the systematic framework and terminology of dK- 
series (IT), which characterizes the properties of a graph using 
series of probability distributions specifying all degree correla- 
tions within d-sized subgraphs of a given graph G. Increasing 
values of d capture progressively more properties of G at the 
cost of more complex representation of the probability distri- 
bution. We refer to the graphs generated by our approach as 
2.5K-graphs because they meet more specified properties than 
2K-graphs (i.e., graphs with a target joint degree distribution) 
but less than 3K-graphs (graphs with specified distributions of 
all subgraphs of three nodes). We show that 2.5K provides 
a sweet spot between accurate representation and practical 
constraints. The key insight is that some information about 
clustering is necessary for a realistic representation of real- 
life graphs, especially OSNs, while c(fc) is still practical to 
estimate and generate. More specifically, we make the following 
contributions in estimation and generation of 2.5K. 

Estimation: We derive efficient estimators of the metrics 
of interest, namely JDD and c(fc), based on a node sample 
collected either via an independence sampler or via a random 
walk; the latter is the common practice in sampling OSNs. 
Our design utilizes edges induced between sampled nodes, and 
appropriately corrects for biases introduced by random walk 
resulting from (i) non-uniform sampling weights, and (ii) strong 
dependencies between successive samples. We demonstrate the 
efficiency of the estimators via simulation. In addition, we post- 
process these metrics to ensure that they are realizable i.e., that 
there exist graphs with those properties. 

Generation: We propose a practical algorithm for generating 
2.5-ftT-graphs that follow the target JDD exactly and c(k) very 
closely. Our algorithm starts by generating a graph with exactly 
JDD and more triangles than needed; it then performs double 
edge swaps trying to meet the desired c(k). The key intuition 
and novelty compared to prior approaches, is that destroying 
triangles is much easier than creating new ones. Extensive sim- 
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Fig. 1. Overview of our approach: Sampling, Estimation, Post-processing, Graph Generation. We start by sampling a large real (not fully known) graph (e.f 
Facebook) and we end by constructing synthetic graphs that are similar to the original, in terms of the 2.5K metrics (JDD and clustering). 



ulations of real graphs show several strengths of our approach. 
First, the 2.5K synthetic graph G' is similar to the original G 
not only with respect to the targeted metrics, but also to a wide 
range of other topological properties. Second, our generation 
algorithm is orders of magnitude faster than prior approaches; 
in fact, the latter may not converge in practice. 

The structure of the rest of the paper is the following. Section 
HIl summarizes terminology and the problem statement. Section 
|lll]presents related work. Section [TV] discuss the first part of the 
problem: network sampling, estimation of the 2.5K properties 
and post-processing. Section [V] discuss the second part of the 
problem: a construction algorithm for 2.5K-graphs. Section IVT1 
presents evaluation results on a wide range of fully-known real- 
life graphs. Section WU\ concludes the paper. 

II. Notation and Problem statement 
We consider an undirected, static graph G = (V, E), with 
\V\ nodes and \E\ edges. For the purposes of random walks, 
we also assume that G is connected, and aperiodic. For a node 
v G V, denote by deg(v) its degree, and by Af(v) C V the 
set of neighbors of v. Let also the number of shared partners 
between nodes a and b be: sp(a,b) — |A/"(a) P\JV(b)\. 

A. Graph Properties of Interest 

Joint Degree Distribution (JDD). A widely studied property 
of graphs is the degree distribution. In this paper, we are 
interested in more information captured by the joint node 
degree distribution, defined as the number (or frequency) of 
edges connecting nodes of degree k with nodes of degree I: 

JDD(M) = EE k{a,b}eE}- (1) 

a<EV k b€Vi 

In other words, JDD quantifies a degree-dependent distribu- 
tion of subgraphs of 2 nodes. 

Clustering. One of the most important topological properties, 
especially for OSNs, is clustering. The clustering coefficient 
c v of a node v captures how close the neighbors of a node 
are to forming a clique and is typically defined as the ratio 
of the number of links between the neighbors divided by the 
maximum number of such links. If two neighbors of a node are 
connected, then these three nodes form a triangle, thus leading 
to an equivalent definition 11231 : 

deg(u)(deg(v) - 1)' 
where T v is the number of triangles using v. At a slightly 
coarser granularity, the degree-dependent average clustering 
coefficient c(k) is defined as 

C(fc) = TTFT Cv > (3) 

1 fel vGV k 



where Vk is a set of nodes of degree k. 

Finally, c, the clustering coefficient c v averaged over all 
nodes in G, is defined as follows: 

V 

Note that c v determines both c(k) and c and c(fc) determines c, 
because c = ^ £ fe \V k \ ■ c{k). So, Eq.©, Eq.© and Eq.0 
impose increasingly restrictive constraints on clustering. 

B. dk-series 

In this paper, we follow and build on the systematic frame- 
work of dK-series [11], which characterizes the properties of 
a graph using series of probability distributions specifying all 
degree correlations within d-sized subgraphs of a given graph 
G. Essentially, dk-series extend the notion of JDD to any d-sized 
subgraphs. To be more concrete: 

• OK specifies the average node degree. 

« IK specifies the node degree distribution. 

• 2K specifies the joint degree distribution (JDD), Eq.(Q]i 

• 3K specifies the degree-dependent distribution of sub- 
graphs of 3 nodes, i.e., the number of triangles and wedges 
connecting nodes of degrees k,l,m. 

• NK, where N — \V\, specifies the entire graph. 
Clearly, increasing values of d capture progressively more 
properties of G at the cost of more complex representation 
of the probability distribution. dK determines d'K for every 
df < d. The term "dk-graphs" refers to all graphs that have the 
same d'k distributions for dl = 0, 1, ...d. 

C. 2.25K and 2.5K graphs 

According to 11151161 . 2K captures a number of key graph 
properties, with the exception of clustering, which is inherent 
in most OSNs. On the other hand, all the metrics of clustering 
are completely determined by the 3K distributions. However, 
we argue that 3K is not practical: it requires prohibitively many 
samples to be estimated and it is difficult to generate in practice. 
To address these problems, we introduce more practical notions 
that, intuitively, lie between of 2K and 3K: 

• 2.5K specifies JDD and c(fc): our proposed approach. 

• 2.25K specifies JDD and c: a baseline for comparison. 

D. Problem Statement and Approach 

Our objective is to provide a complete, practical methodology 
for generating graphs that resembles a real graph of interest. We 
use 2.5K as the modeling tool. The problem can be decomposed 
into two parts: 

< Estimation: Given a random walk sample of a real graph, 
estimate JDD(fc, I) and c(k). 

< Graph Generation: Given desired (and realizable) 
JDD(fc, I), c(k), construct a graph with those properties. 

The steps of our approach are summarized in Fig. [T] 
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III. Related Work 

Network Sampling and Estimation. Network measurement 
plays a central role in computer and social network research. 
In this paper we are mostly interested in the topology of 
online social networks, such as Facebook, Twitter and other 
blogging networks, Linkedin, Instagram (to broadcast pictures), 
and Epinions. Due to the large size of these networks, sampling 
is typically used to estimate properties of interest. Recent 
approaches, including our own prior work on sampling OSNs, 
used random walks to sample OSNs and estimate nodal at- 
tributes and local structural properties [5 6 10 14]. However, 
estimating global structural properties based on sampling re- 
mains a challenging problem. 

To characterize the global network structure, model fitting is 
used. Handcock et. al. used the ERGM framework for network 
inference based on sampled data [7|. However, ERGM suffers 
from degeneracy issues and does not scale to even moderate 
graph sizes of thousands of nodes. Kim et. al. use Kronecker 
graphs to fit the network structure of the observed part of the 
network and then estimate the missing nodes and edges [9|. Sala 
et. al. present an evaluation of model fitting for fully known 
social graphs and conclude that the dk-series ITTTI is the best 
model for such task 1151 . surpassing even Kronecker graphs. 

Graph Generation (a.k.a Construction). Generating random 
graphs that have some desired properties is an active research 
area. The complexity of the algorithm and the ability to provide 
guarantees depend on the desired properties. For example, IK 
can be generated using the configuration model fl2l . 2K can be 
generated using an extension of the configuration model ifTTI . 
Although the above approaches may result in multi-edges and 
self-loops, they can be tweaked to avoid that in both IK [4| and 
2K EDI case. Unfortunately, these construction algorithms do 
not generalize to dK, d > 2 and, to the best of our knowledge, 
no efficient algorithm exists today for generating 3K graphs. 

However, IK or 2K are not sufficient to capture many cru- 
cial graph properties, such as (higher than random) clustering, 
which is inherent in virtually all real-life networks including 
OSNs 1151161 . For this reason, the following algorithms attempt 
to construct random graphs with some notion clustering. Ifl3l 
extends the configuration model to generate random graphs 
with a given number of triangles. However, the resulting 
triangles rarely share common edges, which results in small 
values of c v and prevents us from targeting c and IK at the 
same time. [3| targets IK and c using an MCMC approach. 
[18 1 proposes a construction algorithm that targets c(k) while 
preserving IK; they have no control over assortativity, which 
is actually determined by c(k) and IK. In IfTTI . the authors 
target 3K. The approach is to target 3K by 2i"T-preserving 
random rewiring. This approach is unfortunately not practical: 
we contacted the authors of IfTTI who released only the code 
for 2K construction, which, to the best of our knowledge, is 
the most advanced application of dK-series that is achievable 
in realistic time frames. 

IV. Estimation from a sample 

As illustrated in Fig. Q] our first step is to sample the 
underlying unknown graph, and to estimate the two properties 



of 2.5K-graphs, namely JDD(fc, I) and c(k), based on our 
sample S. In this section, we derive such estimators for most 
common sampling methods: independence sampling (uniform 
or weighted) or random walk. The former is possible if one 
can sample directly from the userlD space, whereas the latter 
is the common practice in OSNs via crawling. 

A. Uniform Independence Sampling (UIS) 

UIS samples the nodes directly from the set V, with replace- 
ment, uniformly and independently at random. 

1) Estimation of c(k): Every triangle {a,b,c} contributes 
exactly count 1 to both sp(a,b) and sp(a,c). Therefore 



sp(a, b) 



deg(a) • - ^ — . (5) 



2-E 



beAf{a) 



The latter, seemingly redundant transformation will help us 
write the estimator. Indeed, we are unlikely to cover every node 
b G Af(a) in our sample S. Instead, they are sampled with equal 
probabilities and possibly S'.count(6) > 1 times. Exploiting the 
sampled information, we can estimate T a by 

EfeeJV(a) s P(a, b) ■ S.count(6) 



deg(a) 



2 ' E be A/- ( a) &count(&) 
Plugging it into Eq.©, and taking the average across all nodes 
a of degree deg(a) = k, we obtain 

1 J2aes k EfceA^(a) sp(a,b) • S*.count(6) 

c(k) = — = — ,(6) 

k - 1 J2aes k £&eAT(a) S-count(6) 

where Sk C S are all the sampled nodes of degree k. 

2) Estimation of JDD: Let us rewrite Eq.(fl]i as 

jddou) = iwi • Eag ^r'iT/ { r ,b}e£} - 

\Vk\ ■ \Vi\ 

The fraction on the right hand side divides the number of 
existing edges between Vk and V/ by the maximal possible 
number of such edges (|V^| • \Vi\). Under UIS, we observed 
2~2aes k 2~2bes t 1 {{a,b}eE} such edges out of the maximal num- 
ber | Sk | • | Si | we could possibly observe, leading to the estimator 



JDD(fc,0 = \V k \\Vi\ 



\Vk\\Vi\ 



2~2aes k 2~2bes, 1 



{{a,b}&E} 



\Sk\ ■ \Si\ 
2~2aes k 2~2be*(a)n Sl S.count(6) 



\S k 



\S, 



(7) 



The values \Vk\ and \Vi\ can be easily estimated as |Vfe| = 
\S k \/\S\ and |Vj| = \Si\/\S\, respectively. 

B. Weighted Independence Sampling (WIS) 

WIS samples the nodes directly from the set V, with replace- 
ments, independently at random, but with probabilities propor- 
tional to node weights w(v). For simplicity and compatibility 
with random walks below, we are interested only in the case 
where w(i>) = deg(v). WIS produces biased (non-uniform) 
node samples. However, because this bias is known, it can 
be corrected by an appropriate re-weighting of the measured 
values, e.g., using the Hansen-Hurwitz estimator [8], as follows. 
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1) Estimation of c(k): We apply the Hansen-Hurwitz esti- 
mator to Eq.© by dividing every term related to some node 
pair by the product of the weights of involved nodes, i.e., 

sp(a,6) 

I / .. v * . \ "/.: \ 

C(fc) = 



beAf(a) w(a)w(6) 



• 5.count(fo) 



fe-i y c „ y 



k Z^b £A/"(a) w(a)w(6) 



5.count(&) 



1 



Eaes* E 6e ^(a) • S-count(&) 



(8) 



k - 1 Eaes k EfeeAf(a) 35iC5J ' S-«)unt(&) 

In the last step, we used w(u) = deg(w). The terms deg(a) 
cancelled out, because deg(a) = k for every a G S a . 

2) Estimation of JDD: Assuming w(u) = deg(w), applying 
the Hansen-Hurwitz estimator to Eq.(|7]i does not change it (now 
both deg(a) and deg(6) cancel out). However, now both T4 
and \Vi\ must be corrected for the biases, as in ||6l 



\V k \ 



E 



- L {dog(a) = fc} 

seS dog(s) 



IV; I 



E 



1 {dcg( 3 ) = Q 

sGS dcg(s) 



E 



(9) 



sGS dcg(s) 



EseS dcg(s) 

C. Simple Random Walk (RW) 

RW selects the next-hop node v uniformly at random among 
the neighbors of the current node u. In a connected and 
aperiodic graph, the probability of being at the particular node v 
converges to the stationary distribution 7r RW (w) = ^fgr ■ 

1) Estimation ofc(k): We propose two techniques. 

a) Induced Edges with safety margin M: Generally 
speaking, in this approach we interpret nodes S collected by 
RW as WIS. However, because consecutive RW samples are 
correlated, a straightforward application of Eq.® introduces 
a bias. Indeed, RW will observe many more induced edges, 
defined as edges between any two nodes sampled by RW, 
than WIS. For example, every step of RW is guaranteed to 
result in at least one additional induced edge. Moreover, these 
additional induced edges do not follow the same statistical 
distributions and thus introduce arbitrary biases. For this reason, 
we modify Eq.® by ignoring the sample pairs that are closer 
than margin M in S: 



E 



i,j such that 

deg(si)=fc, \j-i\>M, {si,Sj}£E 



Sp{Sj,Sj) 

deg(sj) 



E 



(10) 



z,j such that 
deg(si) = fc, \j—i\>M, {si,Sj}£E 



deg(sj) 



One can check that for M — 0, the above reduces to Eq.®. 
Increasing M makes Eq.dTOb more robust to RW correlations, 
at the cost of discarding information. For practical applications, 
we recommend values 10 < M < 100. 

b) Traversed Edges: This technique is based on the ob- 
servation that edges Se traversed by RW are asymptotically 
uniform fl4l . which leads to 

E S P( U , V ) ■ (l{dog(u)=fc} + l{deg(»=fc}) 

c(k) 1 (u - v)eSE 



k - 1 



E ( 1 {dcg(«)=fc} + l{deg(t))=fc}) 

(u,v)es E 



(11) 



2) Estimation of JDD: Similarly to the estimation of c(fc), 
we propose two techniques. 

a) Induced Edges with safety margin M: The "safety 
margin" M trick we used to correct the clustering coefficient es- 
timator can be also applied here. To this end, we modify Eq.(j7]i 
by ignoring the sample pairs that are closer than margin M in 
S, which results in 



JDE)(fc, /) 



|Vfc||Vz| 



E 

i,j such that \ j — i\>A4, 
dcg(si)—k 7 deg(sj )—l 



i 7 j such that \ j— i\>M, 
dcg(si) — k. deg(sj )—l 



(12) 



where [Vfe| and \Vi\ are calculated as in Eq.©. 

b) Traversed Edges: As before, an alternative approach 
is to interpret edges Se traversed by RW as asymptotically 
uniform. Now, we just check what fraction of edges traversed 
by RW are between nodes of degree k and I, and then we inflate 
this fraction by \E\, as follows: 

J2(u,v)eS E (l{dcg(u) = fe,dog(i))=;}) 



JDD(fc,0 = \E\ 



\Se\ 



(13) 



3) Hybrid Estimators: Under RW, we described two gen- 
eral estimation techniques, Traversed Edges (TE) and Induced 
Edges (IE). They lead to two different estimators. Which one 
should we choose? The answer depends on the graph size 
and structure, on what we are trying to estimate, and on the 
sample size. For example, TE visits one edge per iteration, so 
its collected and exploitable information grows linearly with the 
sample size n. In contrast, while for small n IE may include 
very few edges, it quickly catches up for larger n. So the first 
obvious hint is to compare the number of traversed and induced 
edges. Moreover, while TE samples edges uniformly, IE will be 
more likely to cover edges connecting nodes with high degree. 
Consequently, we expect TE to perform better when estimating 
values related to low degree nodes. 

In order to combine the advantages of both TE and IE, we 
use the estimate of TE for small degrees and the estimate of 
IE for large degrees. In order to switch between TE and IE, we 
compare the actual degree(s) to a threshold, which we choose 
to be the average node degree. These hybrid estimators are the 
ones we use in our approach: 

'^(k) \fk<k 

otherwise. 



c ybM (k) 



T(k) 



(14) 



— - — —hybrid , , 

JDD (k, I) = 



(15) 



JDD (k,l) ifk+l<2k 

— IE 

JDD (k, I) otherwise. 
D. Postprocessing of Estimated Parameters 

1 ) Smoothing: The estimation of the joint degree distribu- 
tion from node samples produces a considerable number of 
high frequency elements in the JDD 2-dimensional matrix. 
For example, degree pair entries that involve one low degree 
node are overestimated since low degree nodes have lower 
visiting probability in random walks. For that reason, we apply 
Gaussian kernel smoothing to the measured matrix to reduce 
the amplitude of such discrete elements. We select the kernel 
bandwidth using Scott's rule of thumb |17|. 
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2) Realizable JDD: There is no guarantee that there exists 
a simple graph that has the joint degree distribution estimated 
in section HVl According to lfl9l . the necessary and sufficient 
conditions that make a joint degree distribution realizable are 
the following: (i) JDD(/c, I) E Z; (ii) deg{k) E Z; (iii) 
JDD(fc,Z) < deg(k) * deg(l), k ^ I; (iv) JDD(fc,fc) < 
(deg(k)^ k = i- ( V ) jDD(fc,jfe) = 2 * f k , e k E Z, where 
deg(k) represents the number of nodes of degree k. 

Conditions (i) and (ii) state that the degree and joint degree 
distributions must be integer. To that end, we had to stochas- 
tically round estimated values of the JDD matrix. Conditions 
(iii) and (iv) state that the number of edges between nodes of 
degree k and I cannot exceed the maximum possible number 
of edges, given the number of nodes D(k) and D(l). Last, (v) 
states that the number of edges between nodes of the same 
degree has to be even. 

It is imperative that the estimated JDD matrix fed into the 
2.5K generator is realizable. Otherwise, the constructed graph 
might contain nodes with degrees that were not sampled from 
the graph. That creates a problem during the phase in which 
we target the measured c(k), since the latter will only contain 
degrees that were sampled from the graph. For example, assume 
that our estimators yield £\ JDD(10, k) — 46, i.e., there are 
46 edges connected to nodes of degree 10. If we construct a 
2K graph using this JDD matrix, we will get 4 nodes of degree 
10 and 1 node of degree 6. There is a problem if the estimators 
also yield J^k JDD(6, k) — 0, which means that there are no 
nodes of degree 6. 

To address this problem, we designed an algorithm that 
slightly modifies the JDD matrix to make it realizable. We for- 
mulated the problem as an optimization problem (to minimize 
the error between the estimated and the modified JDD subject 
to realizability constraints) and we also developed an efficient 
algorithm that provably achieves a realizable JDD. In the above 
example, our algorithm stochastically chooses whether to keep 
5 or 4 nodes of degree 10 and then adds accordingly 4 or 
removes 6 edges in the JDD matrix, while satisfying the above 
conditions. Due to lack of space we omit the details and defer 
the full description to the source code at (TJ. 

In summary, at the end of this step, we have slightly modified 
the estimated JDD to make it realizable and ready to be 
fed to the generation algorithm in the next step. We verified 
that the changes were indeed minimal: in all simulations, 
postprocessing changed no more than 3% of the edges. It 
is worth noting that postprocessing is not necessary for the 
estimated c(fc), since the construction algorithm achieves JDD 
exactly but clustering only approximately. 

V. Generating a 2.hK graph 
In this section, we design an algorithm that takes as input 
the two target properties estimated as in the previous section, 
i.e., the target joint node degree distribution, JDD (fc,Z) and 
the target degree-dependent average clustering, c (fc), and 
constructs a 2.5K-graph with N nodes and the target properties. 

A. Unsuccessful attempts and lessons learned 

1) MCMC: The authors of ifTTI apply an extension 
of the configuration model lfl2l to generate a graph 



that exactly satisfies JDD (fc,Z). Starting from such 
a graph, one can perform 2K -preserving double- 
edge swaps to target the clustering c®(k), as follows: 



MCMC 
do 

randomly select edges (u, v) and (x, y) such that k u = k. 
rewire these edges into (u, y) and (x, v) 
if >~2 k |c (fe) — c(k)\ has increased then 
undo the rewiring 



Although 1 1 1 1 proposed this method to target the entire 
3K, we found it impractical already for its relaxed version 
c(k): very soon after creating the first triangles, there is very 
small probability that edge swap brings us closer to the target. 
Consequently, this MCMC approach takes forever in practice. 

2) Improved MCMC: We tried to address this problem by 
selecting the two candidates for a swap in a smarter way, e.g., 
by favoring edges with fewer triangles attached. The rationale 
was that after deleting these edges few triangles are destroyed. 
Although this improved over the naive MCMC, still we still 
faced scalability problems. 

B. Our 2.5K generator 

Our key insight is that, for the same reason why it is difficult 
to create triangles with double-edge-swaps, it is easy to destroy 
the existing triangles. This suggests starting the MCMC from 
triangle-rich 2K graph rather than with a regular, triangle-poor 
2K graph, as that used in IfTTI . If the starting graph overshoots 
the target c (fc), the job of the MCMC phase is to destroy 
triangles rather than creating new ones, which is much faster. 

Step 1. In order to create a triangle-rich graph with a 
given JDD (fc,Z), we initially follow the two initial steps 
of lfl2l and IfTTI : we create a set of nodes V, where 
\V\ — N, and we assign target degree fc to every node 
v E V such that the target IK distribution (fully defined by 
JDD (/c, I)) is satisfied. Next, we apply the following algorithm. 



Step 2. Greedily create local edges: 

require JDD (k,l) 

for v E V do r v = rand(0, 1) 

dist(u,v) = min(|r„ - r u \, 1 - \r v - r u \) 

E' a list of all possible node pairs {it, v} 

sort E' according to dist(u, v) 

E = 9 

forall {u, v} E E' do 

if JDD(fc u ,fc„)<JDD (fe„,fc„) and k u <k® and k v <k® do 

E <- E U {u,v} 



The above algorithm first assigns to every node v a coordinate 
r v randomly selected from interval (0, 1). Then, it creates a set 
E' of all possible node pairs sorted by increasing distance in 
this one-dimensional coordinate system. Finally, it goes through 
all pairs in E' and creates an edge if the target values JDD and 
fc are not exceeded. This construction ensures that the created 
edges tend to be local (i.e., with small dist(u,v)), which 
leads to many triangles. (Notice that, throughout the algorithm 



execution, the target degree fc and joint degree JDD (fc,Z) 
remain unchanged, while the current node degree k v and joint 
node degree JDD(/c, I) may change with every added/modified 
edge. In the beginning, we have k v = and JDD(fc, I) = 0, 
because there are no edges in the graph yet.) If we reach the 
target values JDD and fc at the end of this step, we are done. 
If the target values are not reached, we are in the situation 
depicted in Fig. |2a) and we need to throw some more edges 
as follows. 

Step 3. We iteratively apply the transformations described in 
Fig. |2j until our graph satisfies precisely the target JDD (fc, I) 
(thus k® as well). In our simulations, we saw that the number 
of created triangles at the end of this step, dramatically exceeds 
the targeted ones, to achieve c (fc), as shown in Fig. |5(b)| 

Step 4. Finally, we apply the 2K-preserving, c (fc)-targeting 
double-edge swaps MCMC, described in Sec. IV- ATI 

We release an implementation of our 2.5K generator at UJ. 

C. Guarantees 

Our algorithm is a heuristic, in the sense that it does not 
currently come with provable guarantees. However, it is worth 
noting that in all our simulations we were able to construct 
graph instances that had always the exact target JDD and 
approximately the target clustering (closer to the target and 
order of magnitudes faster than prior approaches). Theoretical 
guarantees for the achieved properties and a characterization of 
the variability of the constructed graphs are possible directions 
for future work. 

VI. Performance Evaluation 
In this section, we evaluate the performance of our approach. 
First, we evaluate the efficiency of the estimators of the 2.5K 
parameters. Then, we show that our 2.5K generator is orders 
of magnitude faster than state-of-the-art approaches. We also 
show that the generated graphs are very close to the original 
ones with regards to a number of graph properties (beyond JDD 
and clustering, which are met by construction). 

A. Simulation Setup 



Dataset 


\v\ 


\E\ 


ky 


£^ 

vev 


c 


FB: UCSD [21J 


14 948 


443 221 


59.30 


7 995471 


0.227 


FB: Harvard E] 


15 126 


824617 


109.03 


24 848 793 


0.212 


FB: New Orl. (22] 


63 392 


816884 


25.77 


10 504 548 


0.222 


soc-Epinions 121 


75 877 


405 737 


10.69 


4 873 260 


0.138 


email-Enron |2] 


36 692 


183 831 


10.02 


2181 132 


0.497 


CAIDA AS (2] 


26475 


53 377 


4.03 


109 086 


0.208 



TABLE I: EMPIRICAL TOPOLOGIES USED IN SEC.IVTI 



Data Sets. Table J] lists the real topologies that we use in 
our evaluation. The list includes online social networks, email 
communication graphs and autonomous systems graphs. The 
average degree varies from 4 to 109 and clustering varies from 
0.14 to 0.50. We treat all topologies as undirected graphs. 

Comparison of Graph Properties. We measure the difference 
between two discrete distributions using Normalized Mean Ab- 
solute Error (NMAE) defined as: NMAE$, x) = ^fep^ , 

where x and x are the vectors that correspond to the real and 
estimated discrete distributions. NMAE returns the percentage 
of error, averaged over every point in the discrete distribution. 
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Nodes with unreached target degree, i.e., with k v < fc° (here fc„ = fc° - 1). 
Nodes with k v — fc®. 



Fig. 2. Illustration of Step 3. Throwing remaining edges. Several cases: 
A) Nodes with target degree fc® (top) and i® (bottom). Assume that current 
JDD(fc, I) = JDD® (fc, I) — 1, so we still have one edge left to throw. The only 
two nodes with unreached target degree are a and b, but an edge (a, b) already 
exists. B) In that case, we find nodes c and d such that: (i) fc® = fc®, 
(ii) (a, c) does not exist, (iii) (c, d) exists, and (iv) (d, b) does not exist. 
C) Create (a, c), and change (c, d) into (b,d). As a result, JDD(fc,Z) = 
JDD®(fc,Z), JDD() of all other de gree pairs remain untouched, fc c and fc^ 
remain the same, and k a = fc® and fcf, = fc®. D,E) If (b, d) exists for every 
d (rarely happens in practice), follow (b) and (c) without creating (b, d). Now, 
the problem is moved to another pair of degrees. F) Finally, it is possible 
that edge (a, c) exists for all candidates c. G,H) In that case, add an edge 
between two nodes that reached the target degree (here c and e) and delete one 
edge of c and one of e. As a result, we have two pairs of nodes to deal with. 

B. 2.5K Estimation and Postprocessing 

Estimation. In this part, we test our estimators of JDD(k, I) 
and c(k), developed in Sec. [TV] Previously, we introduced two 
techniques to estimate JDD(k,l) and c(k). We argued that 
Traversed Edges is better for very small degrees. Fig. |3(a)| 
demonstrates this point in the Facebook New Orleans network. 
For a sample length of 3%, Traversed Edges better estimates the 
clustering coefficient for degrees k < 30. Fig. Ob,c) show that 
the Hybrid estimator, defined in Eq.(fT4b, outperforms the two 
base estimators for sample length l%-40% in the estimation of 
c(k) and JDD(k, I) . In the rest of the experiments, we always 
use the Hybrid estimator. 

Postprocessing. After the estimation of JDD(k, I) we 
smooth the high frequency elements of the matrix and ensure 
realizability during construction. Fig. [4] shows the effect of 
smoothing on the Facebook New Orleans network. Fig. |4(b)| 
and Fig. |4(c)| are the non-smoothed and smoothed versions 
of a 20% sample length random walk. The smoothed version 
has considerably smaller error (NMAE 0.25 vs 0.52) and its 
highest frequency element is closer to the full graph (480 vs 
960) . Fig. |4] also provides a visual validation of the estimation 



7 



0.40 
0.35 
« 0.30 
E 0-25 
^0.20 

a o.i5 

u 0.10 
0.05 
0.00 



• 

o 


— Full Graph 

Traversed Edges 

■ <fl Hybrid 

O O Ind. Edges - M = 50 




Q 



Traversed Edges 
□ Hybrid 

Ind. Edges - M=50 



Traversed Edges 

■ <■ Hybrid 

o o Ind. Edges - M = 50 



in- 




0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 
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(a) c(k) for samples of length 3%. (b) c(k) for sample length l%-40% (c) JDD(k,l) for sample length l%-40% 

Fig. 3. Facebook New Orleans Estimation of clustering c(k) and JDD(k,l) with smoothing. The results are aggregated over 100 samples. 






(a) Full graph (b) 20% sample w/out smoothing (NMAE:0.52) (c) 20% sample with smoothing (NMAE:0.25) 

Fig. 4. Facebook New Orleans Estimation of Joint Degree Distribution JDD(fc, (). The effect of smoothing. 



Datasel 


2K-T + 


2K-T + 


2K + 


2K + 




Imp. MCMC 


MCMC 


Imp. MCMC 


MCMC 


FB: UCSD (21] 


568 


1742 


24 800 


177 533 


FB: Harvard (2TJ 


1 182 


2 880 


50516 


387 506 


FB: New Orl. |22| 


1463 


5450 


118711 


381 397 


soc-Epinions I2l 


888 


1080 


3 342 


8 958 


email-Enron I2l 


4279 


14 393 


66766 


196 202 


CAIDA AS (2] 


121 


141 


131 


168 



TABLE II: GRAPH GENERATION TIME IN SECONDS. 

result. Last, the modification of the JDD matrix to make it 
realizable results in a small number of edge changes in the 
matrix, typically between l%-5%. Due to lack of space, we 
omit additional results. 

C. 2.5K Graph Generation 

1 ) Speed of Generation: To better understand the gains, we 
evaluate separately two parts of the 2.5K generator: the first 
part constructs a graph with an exact JDD and the second part 
approximately achieves c(k). In the first part, we compare: (i) 
a baseline - the algorithm from ifTTl . simply referred to as 2K; 
and (ii) our algorithm (steps 1,2,3 in Section V.B), which we 
call 2K-T because it constructs an exact JDD but with a large 
number of triangles. In the second part, we compare the two 
options mentioned in Section V.A: (i) MCMC (ii) and Improved 
MCMC. We ran simulations for all four possible combinations 
of the two parts to achieve 2.5K on the datasets of Table U 
Simulations were performed on an AMD Opteron machine 
clocked at 3.2 Ghz. We set as the stopping condition for the 
second part to: NMAE < 2%. 

We present the simulation results in Table [Tj] The best 
performing combination, and thus our proposed method, is 
2K-T+Improved MCMC. It achieves up to 300 times better 
performance than 2K+MCMC. The speedup we obtain can 
be decomposed in two parts: 2-6 times because of improved 




I 


m -k 2K-T + Imp. MCMC 
• - 2K + Imp. MCMC 
»^ 2K-T + MCMC 
—~ 2K + MCMC 
Real Graph 


r,,.,**TtiZ~£Z******~~~ m 



Time (sec) 

(a) NMAE and Average Clustering Coefficient in time. 




(b) Degree-dependent Average Clustering Coefficient. 
Fig. 5. Facebook New Orleans Speed of 2.5K generation 

MCMC and up to 50 times because of the 2K-T construction. 

We further demonstrate this speedup in a simulation of the 
New Orleans Facebook network. Fig |5(a)| shows NMAE error 
and average clustering as a function of simulation time. The 2K 
and 2K-T construction time is ~ 40 and 400 sec, respectively. 
Despite this head start in the first part, the 2K versions take 
between — 381if sec whereas the 2K-T versions take 

between 1K — 5K sec to target c(fc), depending on the version 
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3D':; 
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Epinions 



CAIDA 
AS 



Graph 
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2K 
2.25K 
2.5K 
samp.+ 2K 
samp.+2.5K 
samp.+ 2K 
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2.25K 
2.5K 
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samp.+2.5K 
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109i 
20% 
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2K 
2.5K 
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samp. 4- 2K 
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2K 
2.25K 
2.5K 
samp.+ 2K 
samp.+2.5K 



Norm. Mean Abs. Error. Comparison with Real. 
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0.16 
0.16 







0.28 
0.28 
0.18 
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0.09 
0.09 
0.08 
0.08 







0.15 
0.15 
0.11 
0.11 



0.21 
0.21 
0.23 
0.23 







0.12 
0.12 



Knnl JDEH CC 



Graph property 







0.1 1 
0.1 1 

0.08 
0.08 





o 



0.14 
0.14 
0.09 
0.09 







0.14 
0.14 

0.09 
0.09 







0.11 
0.1 1 
0.10 
0.10 






0.21 
0.21 
0.22 
0.22 







0.28 
0.28 







0.37 
0.37 
0.35 
0.35 







0.49 
0.49 
0.47 
0.47 







0.31 
0.31 
0.25 
0.25 







0.60 
0.60 
0.50 
0.50 






0.63 
0.63 
0.62 
0.62 







0.49 
0.49 



0.87 
0.22 
0.02 
0.88 
0.14 
0.88 
0.11 



0.75 
0.26 
0.02 
0.80 
0.17 
0.78 
0.11 



0.92 
0.19 
0.02 
0.93 
0.17 
0.92 
0.12 



0.31 

0.02 
0.62 
0.29 



I). 2') 



0.73 
0.02 
0.83 
0.12 
0.78 
0.11 



0.27 
0.02 
0.58 
0.31 



ESPI Sh.P. 



1.26 
0.58 
0.44 
1.23 
0.44 
1.26 
0.45 



1.05 
0.61 
0.37 
1.09 
0.41 
1.05 
0.38 



1.3 
0.45 
0.33 

1.4 
0.30 

1.3 
0.36 



0.64 0.37 



0.15 
0.06 
0.41 
0.10 



0.36 0.27 



0.07 



1.01 
0.12 
1.03 
0.19 
1.06 
0.12 



0.44 0.22 



0.08 
0.03 
0.32 
0.15 



0.19 
0.06 
0.03 
0.13 
0.18 
0.09 
0.10 



0.23 
0.07 
0.10 
0.09 
0.28 
0.19 
0.08 



0.35 
0.15 
0.05 
0.27 
0.09 
0.34 
0.04 



0.25 
0.13 
0.09 
0.39 
0.10 
0.07 
0.04 



0.16 
0.10 
0.31 
0.25 
0.30 
0.12 



0.26 
0.21 
0.08 
0.24 
0.18 



"CTIq 



1.29 
4.54 
3.15 
1.25 
3.4 
1.27 
3.92 



1.10 0.40 



1.43 
1.21 
1.08 



1.22 



0.25 
0.17 
0.43 
1.261 0.18 
l.iol 0.43 
0.19 



1.2C 0.68 



1.55 



1.50 0.18 



1.23 



1.46 0.19 



1.22 
1.52 



0.69 



0.49 
0.92 
0.87 



0.85 



1.21 



0.03 

0.32 



C^cT 



0.51 
0.21 
0.13 
0.53 
0.14 
0.53 
0.13 



0.21 



0.70 



0.67 
0.19 



0.42 



0.66 0.30 



0.21 
0.49 
0.28 



0.80 0.33 



0.21 



0.94 



0.80 0.22 



1.10] 0.20 
0.92 
1.501 0.19 



0.23 
0.10 0.39 



0.38 
0.73 



0.26 0.39 



Spect 



0.52 
0.13 
0.10 
0.52 
0.10 
0.52 
0.11 



0.56 
0.25 
0.12 
0.57 
0.13 
0.57 
0.12 



0.54 
0.06 
0.04 
0.56 
0.04 
0.55 
0.05 



0.23 
0.13 
0.07 
0.24 
0.04 
0.26 
0.03 



0.24 
0.03 
0.25 
0.07 
0.16 
0.08 



0.06 
0.05 
0.04 
0.05 
0.05 



TABLE III: RESULTS AVERAGED OVER 5 RUNS . DD: DEGREE 

Distribution, Knn: Average Neighbor Degree Distribution. JDD: 
Joint degree Distribution, CC: Degree-Dependent Average 
Clustering, ESP: Edgewise shared partners, Sh.P.: Shortest 
Paths Distribution, Cliq.: Maximal Cliques Distribution, Cycl.: 
Cycle basis size distribution, Spect. : 20 largest eigenvalues . 



of MCMC used. This huge difference in running time is due to 
the large number of triangles created by our 2K-T construction, 
as shown in Fig. |5(b)| This confirms our intuition that, when 
targeting c(k), the task of destroying triangles is much easier 
than creating new ones. 

We observed that our 2.5K generator (2K-T+Improved 
MCMC) yields larger performance gains in graphs that have 
a high number of triangles per node on average. Social graphs 
and other human communication graphs fall into that category. 
In contrast, the dataset CAIDA AS is a autonomous system 
that has low number of triangles/node, even though it has a 
relatively high global clustering coefficient c. In this dataset, 
we observe from Table [TT] that our 2.5K generator performs 
similarly to 2K + MCM C in terms of construction time. 

2) Matching Graph Properties: We now compare how 
closely the generated graphs resembles the original ones, w.r.t. 
a variety of graph properties. First, we present results when the 
original graph is fully known, which provides a ground truth 
and allows to evaluate how close is the generated graph to the 
original. Then, we also apply our methodology on unknown 
graphs, which we expect to be the main use in practice. 



a) Graph Properties used for Comparison: We consider 
a range of graph properties, beyond just JDD and clustering 
which are met by construction. 

(i) The degree distribution (DD) and degree-dependent average 
neighbor degree (Knn) are fully determined by JDD(k, I). We 
include them mainly for the case in which the original graph 
is unknown and a sample is collected. 

(ii) Edgewise shared partner distribution (ESP) is the proportion 
of edges that have k common neighbors. 

(iii) Shortest path distribution (Sh.P.) is defined as the proba- 
bility of a random pair of nodes to be at shortest path distance 
of h hops from each other. 

(iv) Maximal clique distribution (Cliq.) is defined as the fre- 
quency of maximal cliques. 

(v) Cycles distribution (Cycl.) is the frequency of cycle length 
for a minimal cycle basis, in which a cycle cannot be re- 
constructed by the union of cycles in the base. 

(vi) Spectrum (Spect.): the eigenvalues of a graph are related to 
various graph properties such as expansion, and clusterability. 

(vii) Closeness centrality of a node is defined as the inverse of 
the sum of distances of the node to all other nodes. It captures 
the speed that information spreads from this node to all other. 

b) Results for a Fully Known Original Graph: There are 
two natural questions regarding our approach: 

• 2K vs. 2.5 K: how much does it help to target clustering 
in a graph, after having achieved JDD exactly? 

• 2.25K vs 2.5K: do we need to target the whole degree- 
dependent average clustering c(fc) (2.5K) or can get most 
of the benefits by targeting the global average clustering 
coefficient c (2.25K)? 

We answer these questions by performing experiments as- 
suming that the graph is fully known. This allows us to examine 
the potential of our 2.5K generator without any estimation er- 
rors. For each known datasets we extract the exact JDD(k, I), 
c, c(k) and generate the 2K, 2.25K and 2.5K graphs. 

We present the simulation results in Table [fll] The results 
indicate that targeting c(fc), on top of JDD, reduces the error 
on all considered graph properties, with the exception of the 
clique distribution. The graph properties that benefit the most 
are Spectrum, Shortest Path distribution, and Edgewise Shared 
Partners. Additionally, we deduce that 2.25K get as close as 
2.5K to the Shortest Path distribution. However, on all other 
graph properties 2.5K is noticeably better when compared to 
2.25K in terms of NMAE. 

c) Results for Unknown Graph: Finally, in this section we 
put all the pieces together. Our general work flow is the one 
shown in Fig. [T): (i) sample the original graph G; (ii) estimate 
JDD and c(k); (iii) post-process JDD; (iv) apply our 2.5K 
generator to create a new graph G'; and (v) compare G and G' 
with respect to many metrics. 

Table [TTTl presents results for random walk samples of 10%, 
and 20% length for the datasets New Orleans, Epinions, and 
Enron; samples of 20%, and 30% length for the smaller datasets 
UCSD, Harvard, and CAIDA AS. (We omit 2.25K since we 
have previously shown that 2.5K performs considerably better.) 
The results confirm that targeting c(k), in addition to achieving 
JDD, makes a big difference in terms of the NMAE for all 
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Fig. 6. Facebook New Orleans Distributions of nine graph properties for (i) the full graph (ii) 10% RW sample + 2K construction (iii) 10% RW sample + 
2.5K construction. Results are binned in 30 intervals. 



graph properties considered, despite the unavoidable measure- 
ment errors in the estimation of the model parameters. Fig. [6] 
shows plots of all considered graph properties for the New 
Orleans graph and compares between the full graph, a 10% 
sample + 2K, and a 10% sample + 2.5K. We observe that with 
just a 10% sample we approximate extremely well the degree- 
dependent average clustering and the average neighbor degree. 
In addition, 2.5K gets much closer to all graph properties that 
were not targeted, with the exception of the maximal cliques. 

VII. Conclusion 

Our work provides a complete, and practical methodology 
for generating 2.5-K graphs that resemble a real (possibly 
unknown) graph. We present novel estimators, that measure 
our metrics of interest from node samples, and a novel 2.5K 
generator, that targets these metrics up to orders of magnitude 
faster than prior approaches. We also make publicly available 
a Python implementation for all the building blocks at (TJ. 
We envision that an example application is the following: one 
can apply our methodology to construct graphs that resemble 
Facebook, without having access to the full social graph, by 
simply using crawling/sampling. 
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